130 research outputs found

    Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus

    Get PDF
    In The Low Countries, a major reference corpus for written Dutch is beingbuilt. We discuss the interplay between data acquisition and data processingduring the creation of the SoNaR Corpus. Based on developments in traditionalcorpus compiling and new web harvesting approaches, SoNaR is designed tocontain 500 million words, balanced over 36 text types including bothtraditional and new media texts. Beside its balanced design, every text sampleincluded in SoNaR will have its IPR issues settled to the largest extentpossible. This data collection task presents many challenges because everydecision taken on the level of text acquisition has ramifications for the levelof processing and the general usability of the corpus. As far as thetraditional text types are concerned, each text brings its own processingrequirements and issues. For new media texts - SMS, chat - the problem is evenmore complex, issues such as anonimity, recognizability and citation right, allpresent problems that have to be tackled. The solutions actually lead to thecreation of two corpora: a gigaword SoNaR, IPR-cleared for research purposes,and the smaller - of commissioned size - more privacy compliant SoNaR,IPR-cleared for commercial purposes as well

    QUINE corpus in Autosearch

    No full text
    The QUINE corpus (version 0.5) consists of virtually all of Quine’s 228 books and articles, containing in total 819 documents (books are split into parts), 2,150,356 word tokens, 38,791 word types and 27,837 lemmatized word types. It includes texts in various genres and from different phases of Quine’s thought on various topics, including technical, and formula-heavy writings on logic and the foundations of mathematics. The corpus exhibits a high degree of lexical variation and many instances of fine-grained meaning distinctions

    KANT corpus in Autosearch

    No full text
    Works 1-12 of the 'Gesammelten Werke' of philosopher Immanuel Kant online, available to researchers by invitation, in a corpus exploration and exploitation interface based on WhiteLab by virtue of Autosearch, a CLARIN service provided by INT, or the Institute for the Dutch Language

    OCR Post-Correction Evaluation of Early Dutch Books Online - Revisited

    Get PDF
    Contains fulltext : 162481.pdf (publisher's version ) (Open Access)Tenth International Conference on Language Resources and Evaluation (LREC 2016

    Non-interactive ocr post-correction for giga-scale digitization projects

    No full text
    Abstract. This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.

    QUINE corpus in Autosearch

    No full text
    The QUINE corpus (version 0.5) consists of virtually all of Quine’s 228 books and articles, containing in total 819 documents (books are split into parts), 2,150,356 word tokens, 38,791 word types and 27,837 lemmatized word types. It includes texts in various genres and from different phases of Quine’s thought on various topics, including technical, and formula-heavy writings on logic and the foundations of mathematics. The corpus exhibits a high degree of lexical variation and many instances of fine-grained meaning distinctions
    corecore